Enhancing low resource keyword spotting with automatically retrieved web documents
نویسندگان
چکیده
Keyword Spotting (KWS) systems developed for low resource languages with very little transcribed audio suffer due to a small vocabulary (high out-of-vocabulary (OOV) rate) and a weak language model. In this paper, we propose to augment such systems using automatically retrieved web documents. Our procedure can find large volumes of web documents similar to a small pool of training transcriptions within a few hours, by querying a search engine with automatically generated query terms. We then use simple language identification to extract high-confidence text for lexicon expansion and language modeling. Experiments using six very limited language packs (VLLP) from the IARPA-Babel program show web documents can cut the OOV rate by half on the development set, and on average improve keyword spotting performance by 2.8 points absolute measured by the Actual Term Weighted Value (ATWV). In particular, we find most of the gains (8.7 points on average) are from keywords that were OOV in the baseline system, and are converted into in-vocabulary (IV) through lexicon expansion. These gains are obtained even after using subword units (unsupervised syllable-like units and sequences of phones), which are known to greatly enhance OOV keyword search performance.
منابع مشابه
A Personalized Reordering System of Searched Web Pages
The commercial engines retrieved the appropriate documents by user’s keyword. Therefore, they reduce user’s effort and time to find appropriate information. However, the web page list retrieved from the engines is still not enough for user effort free. It is because the search target for each user is usually different according to user profile. Thus, it is need to reorder web page list accordin...
متن کاملA Survey on Various Word Spotting Techniques for Content Based Document Image Retrieval
Searching documents for information and retrieval of relevant documents is a basic activity. Various tools are readily available for searching and retrieval from digital documents, but not much robust methods are available for retrieval from historic documents and old manuscripts as they are not digitized but available in scanned formats. Conventional way of retrieval from scanned document imag...
متن کاملKeyword spotting in unconstrained handwritten Chinese documents using contextual word model
a r t i c l e i n f o Keywords: Keyword spotting Chinese handwritten documents Word similarity Contextual word model This paper proposes a method for keyword spotting in off-line Chinese handwritten documents using a contextual word model, which measures the similarity between the query word and every candidate word in the document by combining a character classifier and the geometric context a...
متن کاملLeveraging web resources for keyword assignment to short text documents
Assigning relevant keywords to documents is very important for efficient retrieval, clustering and management of the documents. Especially with the web cor-pus deluged with digital documents, automation of this task is of prime importance. Keyword assignment is a broad topic of research which refers to tagging of document with keywords, key-phrases or topics. For text documents, the keyword ass...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015